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Abstract—This letter presents a perceptually weighted analysis 
-by-synthesis vector quantization (VQ) algorithm for low bit rate 
MFCC codec. Different from conventional VQ of MFCCs vector, 
this algorithm uses an analysis-by-synthesis technique and aims 
to minimize the perceptually weighted spectral reconstruction 
distortion rather than the distortion of MFCCs vector itself. 
Also, to reduce the computational complexity, we propose a 
practical suboptimal codebook searching technique and embed 
it into the split and multistage vector quantization framework. 
Objective and subjective experimental results for Mandarin 
speech show that the proposed algorithm yields intelligible and 
natural sounding speech for speech coding at 600-2400 bit/s. 
Compared to current VQ in MFCC codec, the output speech 
quality is substantially improved in terms of frequency-weighted 
segmental SNR, STOI, PESQ and MOS score. 


Index Terms—MFCCs, vector quantization, speech coding, 
analysis-by-synthesis. 


I. INTRODUCTION 


FCC codec attempts to encode the speech signal 

through quantization of mel-frequency cepstral coeffi- 
cients (MFCCs), which provides a promising new approach for 
speech coding throughout 600-4800 bit/s. The speech quality 
of MFCC codec even exceeds the output of state-of-the-art 
MELPe codec [1]-[2]. Also, high-resolution MFCCs vector 
encoded in MFCC codec could be easily down-converted to 
low-resolution MFCCs vector for distributed speech recogni- 
tion (DSR) in ETSI Aurora DSR standard [3]. Despite natural 
sounding speech yielded from MFCC codec, there is still room 
for further improving the speech quality. For example, there 
exists the phenomenon of spectrum smearing since the triangle 
frequency window used for MFCCs extraction is overlapped. 
Moreover, the quantization process of MFCCs vector will 
further aggravate this problem, which degrades the articulation 
of the coded speech for MFCC codec. 

Vector quantization (VQ) plays important role in reducing 
the bit rate of speech coding. However, the quantization error 
increases rapidly with the decreasing of quantization bits 
[4], which is the main reason accounting for degradation of 
speech quality. In the current MFCC codec, MFCCs vector 
is directly quantized with the objective of minimizing the 
quantization distortion of itself, i.e., the codeword of minimum 
square error is selected as the quantized vector [2]. Yet, this 
conventional VQ method can not straightly illustrate the effect 
of quantization on the final speech distortion. Inspired by the 
perceptual weighting technique used in the low bit rate codec 
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[5]—[6], which considers the auditory masking properties of 
the human’s ear, we change the objective of VQ of MFCCs 
vector as: the codeword of minimum perceptually weighted 
spectral reconstruction distortion is selected as the quantized 
MFCCs vector. To achieve this objective, we propose a new 
framework for VQ of MFCCs vector to minimize the end-to- 
end perceptually weighted spectral distortion for speech signal 
rather than the quantization distortion for MFCCs vector. The 
proposed method strategically use a closed-loop technique 
known as analysis-by-synthesis (AbS), which has been broadly 
used for speech coding and the quantization of compressed 
sensing measurements [7]-[10]. The synthesis step employs a 
speech power spectrum reconstruction technique for measuring 
the effect of MFCCs vector quantization on the final speech 
quality, and the analysis step is performed followed by the 
synthesis step in order to select an appropriate codeword to 
minimize the perceptually weighted spectral distortion. To 
the best of our knowledge, the perceptually weighted AbS 
approach has not been used for VQ of MFCCs vector earlier, 
which is shown to provide a much better speech quality 
than the conventional VQ method that directly quantizes the 
MFCCs vector itself in an open-loop fashion. Since AbS re- 
quires higher computational cost, a low complexity suboptimal 
codebook searching technique is also proposed. 

Notations: for the k‘” speech frame (k = 1,2,...,K), Yk 
and fp denote the power spectrum and MFCCs vectors, y and 
fr denote their quantized values, W, denotes the perceptual 
weighting matrix. F = {fi, fo,..., fa}, Fs = ffi, fos- fr} 
denote the codebook and suboptimal codebook of MFCCs 
vector, respectively. fi; j = 1,2,..., J denotes the codeword 
(also called centroid in this paper). y; denotes the power 
spectrum reconstructed from fi. Cj, j = 1,2, ...J denotes the 
jt” cluster in the codebook training phase. 


II. ANALYSIS-BY-SYNTHESIS VECTOR QUANTIZATION 
A. Mel-frequency cepstral coefficients 

The extracting procedure of MFCCs begins with enframing 
the speech waveforms x(n) by a window w (n), 


Lm (n) = x2(mR +n) w(n) (1) 


where N(0 < n < N — 1) is the window length, R is the 
frame shift, m is the frame index. Then, the speech frame 
could be concisely denoted as, 


L = [Em (0), £m (1), -£m (N —1)]" (2) 
The power spectrum of each speech frame is, 
y= |F {z} (3) 
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where F {a} is the N-point FFT of x, |-| denotes the modulus 
of a complex number. 

The latter N/2 — 1 elements of y will be discarded due to 
the symmetry. Then, the power spectrum is Mel-filtered by a 
set of weighting functions, i.e., the Mel-scale weighting matrix 
® e R”*(/2+1) where M is the number of Mel-filter bands. 
Generally, ® is designed based on human perception of pitch 
frequency and implemented in the form of a bank of filters, 
each filter is with a triangular frequency response [2]. Finally, 
MFCCs vector is computed through the log(-) and discrete 
cosine transform (DCT), 


f = DCT {log(@y)} = D log(®y) (4) 


where D is the M x M DCT matrix. 
The power spectrum y could be approximately recon- 
structed from MFCCs vector f as follows [1]-[2], 


y = Blexp(D"'f) = (818) 'PTexp(D f) (5) 


where ®t denotes the Moore-Penrose pseudo-inverse of ®. 


B. Distance measure for VQ of MFCCs vector 


Conventionally, the MFCCs vector f is directly quantized 
using the Euclidean distance measure [2], 
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here, we consider the perceptual weighting filter P(z), 
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where H(z) = 1 / |1— >> a;z~* ) is the linear prediction 


(LP) synthesis filter, a; is the short-term LP coefficients and 
p is the prediction order. 0 < y < 8 < 1 are the perceptual 
weighting factors that control the energy of the error in the 
formant regions [11]. Referring to the EVRC codec, we choose 
8B =0.9,y = 0.5, respectively [12]. 

Let P(w) denote the frequency response of perceptual 
weighting filter defined in (7), 


P (w) = P (2) |zaei (8) 


In the spectral domain, the perceptual weighting matrix W 
could be expressed as a diagonal matrix, 


wo 0 0 
0 Wi Sek 0 

WS e, , ; (9) 
0 oe 0 WN/2 


here, 


i=0,1,...,N/2. (10) 


As mentioned above, N is the FFT length. 


Consequently, the perceptually weighted spectral distortion 
between the original power spectrum y and the quantized 
power spectrum y could be expressed as, 


d(y,G) = 3 wilyi— Hi) =y -W (y-8) AD 


C. Overview of perceptually weighted AbS VQ 


As is shown in Fig. 1, the proposed perceptually weighted 
AbS VQ mainly consists of two steps: a synthesis step that 
reconstructs the speech power spectrum from the MFCCs 
codeword fi and an analysis step that extracts the power 
spectrum of the kt” speech frame and calculates the percep- 
tually weighted spectral distortion between yx and y;. These 
two steps will be repeated J times for searching the whole 
codebook F. Finally, the codeword of minimum d(yx, y;) is 
selected as the quantized MFCCs vector. 


bit stream 


speech frame j | |MFCCs to power 


spectrum 


FFT & || 


MFCCs vector 
codebook F 


LP analysis H(z) to P(z) 


Fig. 1. Diagram of perceptually weighted AbS VQ. 


D. Codebook training 


To train the codebook F, there are two problems to resolve. 
The first is how to assign each training sample into a cluster, 
the second is how to determine the centroid of each cluster. 
Conventionally, the first can be resolved via the nearest neigh- 
bor criterion. However, the second needs further derivations. 

Let Y; denote the power spectrum reconstructed from the 
centroid of the jt” cluster. For any training sample yp, 


d (yr, 83) = d (yx, ®'exp(D~"f,)) 


Then, yx can be assigned into a cluster C; according to the 
nearest neighbor criterion, 


C; = {Yk ld (Yk, Jj) < d (Yk, Yi) k T 1,2,...,.K ’ 
(1 hee a 


Consequently, for K speech frames as training, the total 
distortion is, 


K J 
e= > dyi.de) = > X dyes ¥;) 
k= j=1 yy, €C; 


(12) 
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ER 
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the total distortion associated with all samples which belong 
to the jt” cluster, then e; could be concisely represented as, 


J 
et = J €j 
j=1 


With respect to the centroid fis e+, could be minimized, 


(15) 


pi 
o ej 
Oet 2 2 ej (16) 
Of; Of; Of; 
Applying the chain rule and setting oF = 0,we have, 
O®! exp (DF) , 
: Oey =0 (17) 


Of} ðP exp (D-'f,) 
Let us recall the definition of ej, it is apparent that the 
solution of (18) is also the solution of (17), 
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X w |en (DF) = D> Wise 09 
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Wk is a diagonal non-zero matrix, then X> W,, is non- 


YkECj 
singular. Hence, the optimal centroid for the j*” cluster is, 


—1 
f; = Dlog ® 5 Wk a WiYk 


YkECj YkECj 


(20) 


In summary, the codebook training algorithm for perceptually 
weighted AbS VQ is shown in Algorithm1. 


E. Low complexity suboptimal codebook searching 


Different from the codebook training phase, the codebook 
searching procedure should be performed in an online fashion, 
so the computational complexity becomes a major issue. In the 
primary perceptually weighted AbS VQ scheme in Fig.1, we 
should reconstruct speech power spectrum from each code- 
word fj in F and calculate the corresponding perceptually 
weighted spectral distortion. The computational complexity 
is too high because the number of codewords in F, i.e., J 
is very large. Consequently, we propose a low complexity 
suboptimal codebook searching technique as is shown in Fig.2. 
Through the conventional VQ of MFCCs vector fk, K optimal 
candidate codewords are selected from F to constitute the 
suboptimal codebook F,. For VF; € F. Yf: E F\F,, 


~ ||2 = 

|— af <|a-s 

Only the codewords in F, are used for the reconstruction 
of speech power spectrum and the calculation of perceptually 


weighted spectral distortion, so the computational complexity 
will be reduced dramatically, since we will choose I< J. 


2 
(21) 
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Algorithm 1 Codebook training algorithm. 

Input: training samples VY = {y1, Y2,- YK}, weighting 
matrixes W = {W,,W2,...,.Wk}_ _ f 
Output: MFCCs codebook F = { fi, f2,- fz} 
1: Initialization: n = 1,e® = 0, T = 50, 6 
initialized randomly. 

2: while n < T, Ae; > 6 do z 
3: // Line 4 reconstructs the power spectrum y; from fj: 


0.1, F is 


4: ğ = ®t exp (DF) 
5: // Line 6 assigns each sample yx into a cluster C;: 
6: Cj 5 {Yk |d (Yk, Yj) < d (Yk, Yi) „t = 1,2, wy J,i +j} 
7: // Line 8 updates the cluster centroid f;: 
-1 
8: fi = Dlog o( 5 w.) > Wir 
YREC; YREC; 


9: // Line 10 computes the total distortion ez: 


J ” 2 
0 f= E |V (u - at exp (D= F) )| 
j=1 YkECj 2 
11: // Line 12 computes the variation of e;, Aez: 
12; Ae = Iles”? = eC" 
13: n=n+1 
14: end while 
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Fig. 2. Diagram of low complexity suboptimal codebook searching. 


F. AbS based split and multistage VQ 


Furthermore, from the perspective of practical applications, 
split VQ (SVQ) or multistage VQ (MSVQ) is usually adopted 
as an alternative of direct VQ to reduce the storage and 
computational complexity. In order to embed AbS VQ into 
the SVQ and MSVQ framework, we will keep Q optimal 
candidate codewords in each sub-vector codebook or sub-stage 
codebook to constitute the final suboptimal codebook F.. It 
should be mentioned that there is only one stage codebook as a 
result of limited bits for quantization when the speech coding 
rate is 600 bit/s. As is shown in Tab. I, we will design a four- 
stage AbS vector quantizer to quantize the formant and pitch 
information for speech coding at 2400 bit/s, where each stage 
has 4096 codewords. As for speech coding at 1200 bit/s, a 
two-stage AbS vector quantizer is designed, where the first 
stage has 4096 codewords and the second stage has 2048 
codewords. The bit allocation scheme of AbS SVQ is the 
same as which in [2]. Therefore, F, will consist of Q*, Q?, 
Q codewords for speech coding at 2400, 1200 and 600 bit/s, 
respectively. If Q = 1, the AbS SVQ method will regress to 
be the conventional VQ method in [1]-[2]. 
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TABLE I 
BIT ALLOCATION OF ABS MULTISTAGE VQ. 


Rate Bits/ Energy Formant & Pitch 

(bit/s) Frame (C1) (C2 — C60) 

2400 54 6-bit SQ (12-12-12-12)-bit AbS MSVQ 
1200 27 4-bit SQ (12-11)-bit AbS MSVQ 


III. EXPERIMENTS AND RESULTS 
A. Dataset and evaluation metrics 


We use 4000 chosen utterances spoken by 40 speakers (20 
males and 20 females) from CASIA Chinese database [13] to 
train the MFCCs codebook. The duration of training speech is 
~3.5 h, which contains 1,682,162 speech frames (30 ms time 
frames, 7.5 ms frame shifts). Another 96 chosen utterances 
from the same database, spoken by unseen 16 speakers (8 
males and 8 females), are used as the test set. The duration 
of test speech is ~5 m. Both the training speech and the test 
speech are downsampled at 8 kHz. We choose N = 240, M = 
60, p = 10 to extract MFCCs vector and perform LP analysis. 

We use three metrics to evaluate the quality of coded speech, 
which are frequency-weighted segmental SNR (fwsegSNRs) 
[14], perceptual evaluation of speech quality (PESQ) [15] and 
short-time objective intelligibility (STOI) [16]. FwsegSNRs 
and PESQ measures illustrate the overall speech quality while 
the STOI measure illustrates the speech intelligibility. 


B. Objective evaluation of speech quality 


The results of objective evaluation are shown in Tables I- 
IV, in which the reuslts of conventional VQ method are marked 
in underline while the best results are highlighted in bold. 
It is clearly illustrated that the proposed AbS VQ method 
yields substantially higher fwsegSNRs, PESQ, and STOI score 
than the conventional VQ method, which demonstrates that the 
speech quality is much better. Moreover, Fs will approach F 
with the increasing of Q, so the speech quality continues being 
improved. Specifically, the final improvement is impressive 
in the case of speech coding at 2400 bit/s, the average 
fwsegSNRs, PESQ and STOI score is improved by 1.2dB, 
0.12 and 2%, respectively. 

In addition, we can see that AbS MSVQ substantially yields 
better results than AbS SVQ, this is because it fully exploits 
the redundancy between the components of MFCCs vector. 
As mentioned above, larger Q will lead to more codewords in 
Fs, hence, to make the trade-off between speech quality and 
computational complexity, we will choose Q = 2,4,5 in the 
case of speech coding at 2400, 1200 and 600 bit/s, respectively. 


TABLE II 
COMPARISON ON THE FWSEGSNRS (DB). 


Rate Quantization Q 

(bit/s) method I 2 3 4 5 
2400 AbS SVQ 13.74 14.20 14.42 14.52 14.56 
2400  AbS MSVQ 13.84 1455 14.75 14.87 14.94 
1200 AbS SVQ 11.91 12.25 1244 12.51 12.56 
1200  AbS MSVQ 12.05 12.41 12.56 12.65 12.70 
600 AbS SVQ 10.75 10.91 10.97 11.01 11.02 


TABLE III 
COMPARISON ON THE STOI SCORE (%). 


Rate Quantization Q 
(bit/s) method 1 2 3 4 5 
2400 AbS SVQ 91.15 92.31 92.64 92.77 92.89 
2400  AbS MSVQ 91.48 92.69 92.98 93.04 93.35 
1200 AbS SVQ 88.24 89.52 89.83 89.98 90.02 
1200 AbS MSVQ 88.32 89.58 89.86 90.12 90.17 
600 AbS SVQ 84.98 85.86 85.89 86.00 86.07 
TABLE IV 
COMPARISON ON THE PESQ SCORE. 
Rate Quantization 
(bit/s) method I 2 3 4 5 
2400 AbS SVQ 3.22 3.25 3.28 3.29 3.32 
2400 AbS MSVQ 3.23 3.30 3.32 3.34 3.35 
1200 AbS SVQ 2.91 2.98 2.99 3.01 3.02 
1200 AbS MSVQ 2.94 3.00 3.01 3.02 3.03 
600 AbS SVQ 2.63 2.66 2.66 2.67 2.68 


C. Subjective listening test 


14 native Chinese volunteers (7 males and 7 females) 
are invited to participate in a subjective listening test. Each 
volunteer is asked to rate the coded speech through the 
standard five point mean opinion score (MOS) [17]. Each 
volunteer is presented with two speech files (one male and 
one female) encoded by VQ and AbS MSVQ based MFCC 
codec at 2400,1200 and 600 bit/s, respectively. The results of 
subjective listening test are illustrated in Tab. V, which show 
a good match with PESQ evaluation. 


TABLE V 
SUBJECTIVE EVALUATION RESULTS. 
Rate (bit/s) 2400 1200 600 
AbS MSVQ = 3.24+0.10 2.92+0.12 2.700.15 
VQ [1]-[2] 3.16+0.15  2.82+0.14 2.650.15 


D. Discussion on further performance improvement 


Instead of utilizing the Moore-Penrose pseudo-inverse to re- 
construct the speech power spectrum in (5), the synthesis stage 
of AbS VQ could be improved by some new algorithms [18]— 
[19], which is expected to further improve the speech quality. 
Furthermore, the speech waveforms could be reconstructed 
from the power spectrum using the improved algorithm in [20] 
rather than the inverse short-time Fourier transform magnitude 
algorithm in [21], which is also beneficial for yielding higher 
speech quality by taking into account the differences between 
voiced speech and unvoiced speech. 


IV. CONCLUSION 


In this paper, we propose a perceptually weighted AbS VQ 
approach for low bit rate MFCC codec. The objective of VQ 
is changed to minimize the perceptually weighted spectral 
reconstruction distortion rather than the distortion of MFCCs 
vector itself. A suboptimal codebook searching technique is 
proposed for practical implication. Objective and subjective 
tests show that the speech quality is substantially improved 
when compared to the output of current MFCC codec. 
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